Annual imcome could be estimated by using booked price and number of bookings. However, no booking data was provided by Airbnb. Therefore, we can approximitely calculate the income with average price and number of bookings. The number of bookings could be estimated by 'minimum nights per booking' and 'number of reviews per month'.
Annual income could be under extimated for some listings since the minimum_nights might be shorter than actual stay; the review records would be smaller than actual booking because not all the user leave their reviews after booking.
The average price might not correctly represent average income for annual booking because some of the listings have high variance in the price throughout the year.
Data from some listings are outliers of the distributions, it might not the correct actual host data.
from IPython.display import Image
Image(filename='img/a1.png',width=800, height=600)
Image(filename='img/a2.png',width=600, height=400)
Image(filename='img/a3.png',width=600, height=400)
Image(filename='img/a4.png',width=600, height=400)
Image(filename='img/a5.png',width=600, height=400)
Image(filename='img/a6.png',width=600, height=400)
Image(filename='img/a7.png',width=600, height=400)
import numpy as np
import pandas as pd
# read datasets
berlin = 'Berlin Airbnb Dataset'
calendar = pd.read_csv(berlin+"/calendar.csv")
listings = pd.read_csv(berlin+"/listings.csv")
reviews = pd.read_csv(berlin+"/reviews.csv")
print('calendar dataset date range:', calendar.date.min(), calendar.date.max())
print('review dataset date range:', reviews.date.min(), reviews.date.max())
import re
prices = calendar.dropna() #filter out null values
prices['price_numeric'] = prices.price.apply(lambda x: float(re.sub('[$,]', '', x))) # str -> numeric
price_by_listing = prices.groupby('listing_id').agg({'price_numeric':['mean', 'max', 'min', 'std']})['price_numeric'].reset_index()
price_by_listing.head()
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
price_by_listing['std'].fillna(0).hist(bins=100)
plt.yscale('log')
plt.show()
avg_price = price_by_listing[price_by_listing['std']<200][['listing_id','mean']]
print(avg_price['mean'].describe())
avg_price.head()
from datetime import datetime
reviews['date'] = pd.to_datetime(reviews.date)
reviews_filtered = reviews[reviews['date'] > pd.to_datetime('2017-11-07')]
n_reviews = reviews_filtered.groupby(['listing_id']).size()
print(n_reviews.shape[0])
listings.columns
listings_df = listings[['id', 'host_id','latitude', 'longitude','property_type','room_type','accommodates','minimum_nights', 'reviews_per_month', 'price']]
listings_df['price'] = listings_df.price.apply(lambda x: float(re.sub('[$,]', '', x))) # str -> numeric
print(listings_df.shape)
print(listings_df.isna().sum())
listings_df.head()
NaN_avg_price = pd.merge(listings_df, avg_price, how='left', left_on='id', right_on='listing_id')['mean'].isna()
price = listings_df[NaN_avg_price][['id','price']].rename(columns={'id':'listing_id', 'price':'mean'})
all_price = pd.concat([avg_price, price]).rename(columns={'mean':'avg_price'})
p1_df = pd.merge(listings_df, all_price, how='left', left_on='id', right_on='listing_id')
p1_df.head()
p1_df.minimum_nights.hist(bins=100)
plt.title('minimum_nights distribution')
plt.xlabel('minimum nights')
plt.yscale('log')
plt.show()
print(p1_df.minimum_nights.describe())
bplot = sns.boxplot(x=p1_df[p1_df['minimum_nights']<90].minimum_nights, width=0.5, palette="colorblind")
bplot.set_title('minimum nights distribution after filtered <90')
print(p1_df[p1_df['minimum_nights']<90].minimum_nights.describe())
p1_df['min_nights_new'] = p1_df['minimum_nights'].apply(lambda x: x if x<5 else 5)
problem_1 = p1_df.fillna(0).drop(['price','minimum_nights','listing_id'], axis=1)
problem_1['annual_income']= problem_1.avg_price * problem_1.min_nights_new * problem_1.reviews_per_month * 12
problem_1.head()
income_by_host = problem_1.groupby('host_id').agg({'annual_income':'sum'}).reset_index()\
.rename(columns={'annual_income':'annual_income_by_host'})
income_by_host.sort_values(by=['annual_income_by_host'],ascending=False).head()
print(income_by_host.annual_income_by_host.describe())
income_by_host.annual_income_by_host.hist(bins=200)
plt.title('Annual income distribution by host')
plt.xlabel('annual income')
plt.yscale('log')
plt.show()
problem_1a = pd.merge(problem_1, income_by_host, how='left', on='host_id')
problem_1a.head()
problem_1a.to_csv("problem1a.csv", index=False) # save as csv for tableau visualisation
from datetime import datetime
calendar['date'] = pd.to_datetime(calendar.date)
calendar.price.fillna('-1', inplace=True) # NaN-> -1
calendar['price'] = calendar.price.apply(lambda x: float(re.sub('[$,]', '', x))) # str -> numeric
calendar.groupby('available')['listing_id'].count()
calendar.to_csv('problem1b.csv')
Image(filename='img/b1.png',width=600, height=400)
Image(filename='img/b2.png',width=600, height=400)
Some listings show extremly high prices.
Design a normalized listing ranking system which would allow users seeking Airbnb services (or even Airbnb marketplace team) to get a comparative holistic view of the Berlin listings. For the ranking, please consider an holistic approach (it cannot be an univariate model). For example: you can consider host attributes, listing’s price, distance to important locations etc.
Location: tourist places(tourism type), restaurant, bus/train all three highest
ranking_features = listings[['id','host_id','host_response_time', 'host_response_rate',
'host_is_superhost','host_has_profile_pic', 'host_identity_verified',
'number_of_reviews','review_scores_rating',
'review_scores_accuracy', 'review_scores_cleanliness',
'review_scores_checkin', 'review_scores_communication',
'review_scores_location', 'review_scores_value']]
ranking_features.head()
print(ranking_features.shape)
ranking_features.isna().sum()
However, in the Airbnb data, more listings in London than Paris, it is approximately 1.35 times.
Therefore, although the Airbnb data looks related to the number of tourists in these two cities, we could not make assumption with these small part of the dataset.
london = pd.read_csv('Comparative Market Analysis Dataset/London 05 June 2019.csv')
paris = pd.read_csv('Comparative Market Analysis Dataset/Paris 05 June 2019.csv')
print('number of listings in London: ',london.shape[0])
print('number of listings in Paris: ', paris.shape[0])
paris.columns[0:50], paris.columns[50:107]
london_guests = london[['guests_included','reviews_per_month']]
london_guests['n_guests'] = london.guests_included * london.reviews_per_month *12
paris_guests = paris[['guests_included','reviews_per_month']]
paris_guests['n_guests'] = paris.guests_included * paris.reviews_per_month *12
print('Estimated number of guests of Airbnb in London: ', round(london_guests.n_guests.sum(), 2))
print('Estimated number of guests of Airbnb in Paris: ', round(paris_guests.n_guests.sum(), 2))
print('Airbnb guests - London > Paris: ', round(london_guests.n_guests.sum()/paris_guests.n_guests.sum(), 2))
print('Tourists - London > Paris: ', round(20000000/18000000, 2))
print('However, more Airbnb listings in London than Paris: ', round(london.shape[0]/paris.shape[0], 2))
According to expatistan [3], London is 20% more expensive than France. What do you think from the Airbnb dataset (Careful with the assumptions, make sure to validate assumptions before performing hypothesis testing).
Price on the Airbnb could reflect accomodation price in the city. We could compare the prices of the listings in these two cities.
Based on the result, the prices in London are more spread out than Paris which means London have many lower price accomodation, at the same time it has some much higher price accomodation as well while the prices in Paris are more normal distributed.
Therefore, it could not give a conclusion that London is more expensive than Paris.
london_price = london.price.apply(lambda x: float(re.sub('[$,]', '', x))) # str -> numeric
paris_price = paris.price.apply(lambda x: float(re.sub('[$,]', '', x))) # str -> numeric
print('London prices: \n',london_price.describe())
print('\nParis prices: \n',paris_price.describe())
sns.distplot(london_price,hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 2}, label = 'London')
sns.distplot(paris_price, hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 2}, label = 'Paris')
sns.distplot(london_price[london_price<=135],hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 2}, label = 'London') \
.set_title('Price distribution in London vs Paris')
sns.distplot(paris_price[paris_price<=120], hist = False, kde = True,
kde_kws = {'shade': True, 'linewidth': 2}, label = 'Paris')
sns.boxplot(data=[london_price[london_price<=135], paris_price[paris_price<=120]], width=0.5)\
.set_title('Prices in London vs Paris')
plt.xticks([0,1], ['London', 'Paris'])
Image(filename='img/3-1.png',width=600, height=400)
Image(filename='img/3-2.png',width=600, height=400)
# London
london_property_type = london.property_type.value_counts().reset_index()\
.rename(columns={'property_type':'London','index':'property_type'})
london_property_type['London%'] = round(london_property_type.London/london.shape[0]*100,4)
# Paris
paris_property_type = paris.property_type.value_counts().reset_index()\
.rename(columns={'property_type':'Paris','index':'property_type'})
paris_property_type['Paris%'] = round(paris_property_type.Paris/paris.shape[0]*100, 4)
# Merge
property_type = pd.merge(london_property_type, paris_property_type, how='outer', on='property_type')
property_type.head(10)
property_type.to_csv('problem3c.csv', index=False)
# London
london_room_type = london.room_type.value_counts().reset_index()\
.rename(columns={'room_type':'London','index':'room_type'})
london_room_type['London%'] = round(london_room_type.London/london.shape[0]*100,4)
# Paris
paris_room_type = paris.room_type.value_counts().reset_index()\
.rename(columns={'room_type':'Paris','index':'room_type'})
paris_room_type['Paris%'] = round(paris_room_type.Paris/paris.shape[0]*100, 4)
# Merge
room_type = pd.merge(london_room_type, paris_room_type, how='outer', on='room_type')
room_type.head(10)
room_type.to_csv('problem3c_2.csv', index=False)